- 
                Notifications
    
You must be signed in to change notification settings  - Fork 25.6k
 
ESQL: Make field fusion generic #137382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ESQL: Make field fusion generic #137382
Conversation
Speeds up queries like ``` FROM foo | STATS SUM(LENGTH(field)) ``` by fusing the `LENGTH` into the loading of the `field` if it has doc values. Running a fairly simple test: https://gist.github.com/nik9000/9dac067f8ce29875a4fb0f0359a75091 I'm seeing that query drop from 48ms to 28ms. So, like, 40% faster. More importantly, this makes the mechanism for fusing functions into field loading generic. All you have to do is implement `BlockLoaderExpression` on your expression and return non-null from `tryFuse`.
| 
           Hi @nik9000, I've created a changelog YAML for you.  | 
    
| * "fusing" the expression into the load. Or null if the fusion isn't possible. | ||
| */ | ||
| @Nullable | ||
| Fuse tryFuse(SearchStats stats); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to find another name - we already have Fuse as a command. ExpressionFieldLoader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is FusedExpression ok? Or still too indicative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming... 😅
I come from staring at FUSE enough that it carries a lot of weight.
For me, this feature involves BlockLoaders. And Expressions that are applied to them. I understand that fuse means getting together those two, but it's not something I would think of immediately without more context.
I'd prefer to be overly explicit here, and call this BlockLoaderExpression or something similar that helps me bridge those two concepts together. But, naming...
| BlockLoaderExpression.Fuse fuse | ||
| ) { | ||
| // Only replace if exactly one side is a literal and the other a field attribute | ||
| if ((similarityFunction.left() instanceof Literal ^ similarityFunction.right() instanceof Literal) == false) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! It's much better to let the Expression deal with the details and make this generic 👍
| */ | ||
| public boolean pushable() { | ||
| return true; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bothers me. I needed this because without it we'd try to push this:
FROM foo
| WHERE LENGTH(kwd) < 10
to the index. Now, we might be able to do that with a specialized lucene query. But we don't have one of those. Without those change instead what happens is:
LENGTH(kwd)becomes$$kwd$length$hash$.- We identify 
$$kwd$length$hash$ < 10as pushable. 
This tells us we can't push it. But it's kind of picky. If SearchStats took EsField it could check this easy enough. That might be a good solution to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MultiTypeEsField is created with aggregatable=false, so that predicates on it don't get pushed down incorrectly.
Adding pushable should also work.
        
          
                ...sql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalLogicalPlanOptimizerTests.java
          
            Show resolved
            Hide resolved
        
      Adds special purpose `BlockLoader` implementations for the `MV_MIN` and `MV_MAX` functions for `keyword` fields with doc values. These are a noop for single valued keywords but should be *much* faster for multivalued keywords. These aren't plugged in yet. We can plug them in and performance test them in elastic#137382. And they give us two more functions we can use to demonstrate elastic#137382.
| } | ||
| 
               | 
          ||
| public void testLengthInWhereAndEval() { | ||
| assumeFalse("fix me", true); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QL friends: This one looks fun!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that we get duplicated reference attributes here is that when PushExpressionsToFieldLoad creates a new FunctionEsField in EsRelation, it was generated under a specific command context, and it doesn't look at the the whole query plan level. So when the same LENGTH(last_name) is referenced in multiple commands in the query, duplicated FunctionEsFields are added into EsRelation.
ResolveUnionTypes has a very similar workflow. It iterates through the entire query plan to prepare the attributes added into EsRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, I'm rewriting this to look more like ResolveUnionTypes in #137392
| 
           Pinging @elastic/es-analytical-engine (Team:Analytics)  | 
    
| 
           This is ready for folks to look at! I left two open questions around planning as self-comments.  | 
    
| 
           I think I'd also like to write a really basic integration test that uses proves the fusion is happening.  | 
    
| 
           A little test: https://gist.github.com/nik9000/3a8eb7065d20c6f36d0c71fad80d1d2c With few unique values the perf goes from ~85ms to ~13ms. If there's enough values to trigger #137217 (comment) then it's slower. So, as always, @dnhatn was right. I'll make the sort, iter, uniq bit he suggested.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 💯
Some minor nits on naming (I'd really like to stay away from another FUSE word).
| * "fusing" the expression into the load. Or null if the fusion isn't possible. | ||
| */ | ||
| @Nullable | ||
| FusedBlockLoaderExpression tryFuse(SearchStats stats); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| FusedBlockLoaderExpression tryFuse(SearchStats stats); | |
| FusedBlockLoaderExpression getFusedBlockLoaderExpression(SearchStats stats); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've swapped this to tryPushToFieldLoading.
        
          
                ...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/vector/L2Norm.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | public class FuseExpressionToFieldLoad extends OptimizerRules.ParameterizedOptimizerRule<LogicalPlan, LocalLogicalOptimizerContext> { | ||
| 
               | 
          ||
| public PushDownVectorSimilarityFunctions() { | ||
| public FuseExpressionToFieldLoad() { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public FuseExpressionToFieldLoad() { | |
| public BlockLoaderExpressionToFieldLoad() { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to use PushExpressionsToFieldLoad.
| import org.elasticsearch.compute.operator.Warnings; | ||
| import org.elasticsearch.xpack.esql.core.tree.Source; | ||
| 
               | 
          ||
| public class BlockLoaderWarnings implements org.elasticsearch.index.mapper.blockloader.Warnings { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess these are warnings that can be created when using BlockLoader function config to load values. Should we add that to the javadoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…expression/function/vector/L2Norm.java Co-authored-by: Carlos Delgado <[email protected]>
| } | ||
| 
               | 
          ||
| public void testLengthInWhereAndEval() { | ||
| assumeFalse("fix me", true); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that we get duplicated reference attributes here is that when PushExpressionsToFieldLoad creates a new FunctionEsField in EsRelation, it was generated under a specific command context, and it doesn't look at the the whole query plan level. So when the same LENGTH(last_name) is referenced in multiple commands in the query, duplicated FunctionEsFields are added into EsRelation.
ResolveUnionTypes has a very similar workflow. It iterates through the entire query plan to prepare the attributes added into EsRelation
| 
               | 
          ||
| public void testLengthInWhereAndEval() { | ||
| assumeFalse("fix me", true); | ||
| assumeTrue("requires similarity functions", EsqlCapabilities.Cap.VECTOR_SIMILARITY_FUNCTIONS_PUSHDOWN.isEnabled()); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems work! That is an old capability.
| ); | ||
| var name = rawTemporaryName(fieldAttr.name(), similarityFunction.nodeName(), String.valueOf(arrayHashCode)); | ||
| FunctionEsField functionEsField = new FunctionEsField(fuse.field().field(), e.dataType(), fuse.config()); | ||
| var name = rawTemporaryName(fuse.field().name(), fuse.config().name(), String.valueOf(fuse.config().hashCode())); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason that we need a hash code in the new attribute's name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original assumption is that we needed the name to be the same for attributes that had the same field name and similarity vector, so I originally included the vector hash as part of the name.
Alex told me that this is not necessary, so I'll remove that in #137392.
Speeds up queries like
by fusing the
LENGTHinto the loading of thefieldif it has doc values. Running a fairly simple test:https://gist.github.com/nik9000/9dac067f8ce29875a4fb0f0359a75091 I'm seeing that query drop from 48ms to 28ms. So, like, 40% faster.
More importantly, this makes the mechanism for fusing functions into field loading generic. All you have to do is implement
BlockLoaderExpressionon your expression and return non-null fromtryFuse.